# Prosper Loan Data Exploration

The project is aimed at doing exploratory data analysis on Prosper Loan Data . This is a financial data related to loan browers , lenders , interest rate many more. Prosper.com is one of the leading peer to peer lending sites on the web today. ===============================================================================

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Univariate Plots

There are around 81 variables. Let us ask some question and explore the dataset for better understanding.

Plot 1 - What is the distribution of loan over years?

Plot 1 - Comments

From the graph above the number of loans borrowed has incresed from 2005 - 2008. Then there is a drop in the number of loans offered. Again the loan number has been increasing from 2010 - 2013. But again there is a dip in 2014.

Plot 2 - What is the range of loan amount being borrowred?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
ggplot(pd, aes(LoanOriginalAmount)) +
  geom_density(fill = "#45A3E6")+
  geom_vline(xintercept = 8337, size = 1, colour = "#FF3721",
               linetype = "dashed")+
  ggtitle("Loan Amount Distribution")

Plot 2 - Comments

The loan amount is highly skewed to the left, which means to say that most of the loan is for small amount. The maximun loan amount borrowed is 35000 and minimum is 1000. The mean loan amount is 8337.

Plot 3 - What are the differnet loan status values in the data set.Which loan status is hightly prevalent?

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Plot 3 - Comments

Most of the loans are current. There are also some loans which are defaulted and chargedoff.

Plot 4 -What term are the loans usually taken for?

Plot 4 - Comments

Prosper seems to offer loans for 12 , 36 or 60 months term.The most common being 36 months loan term.

Plot 5 - Why do people take loan?

Plot 5 - Comments

The most common purpose to borrow loan from Prosper is for Debt Consolidation followed by other.

Plot 6 - What is the distribution of credit score both upper range?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

Plot 7 - What is the distribution of credit score both lower range?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

Plot 6 , 7 - Comments

The distribution of Credit score upper and lower range look almost normally distributed. They do have some outliers and NA(non available) data. Credit score is a inflencing factor for getting a better interest rate. Instead of 2 separate variables we can combine them into one and call it CreditScore.

Plot 8 - Combine Upper and Lower Credit Score into one.

summary(pd$CreditScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

Plot 8 - Comments

The Credit Score follows the same pattern as CreditScore Upper nad Lower Range of normal distribution. The mean Creditscore is 695.1

Plot 9 - How are the borrowers APR and Rate distributed in the dataset?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

Plot 9 - Comments

The distribution of Borrower APR is almost normal with a mean rate of 0.21883 The distribution of Borrower Interest Rate is almost normal with a mean rate of 0.1928.

Plot 10 - What kind of ProsperScore rating have been given to the borrower?

Plot 10 - Comments

Prosper score is a custom risk score built using historical Prosper data to assess the risk of Prosper borrower listings. The output to Prosper users is a Prosper score which ranges from 1 to 11, with 11 being the best, or lowest risk, score. The worst, or highest risk, score, is a 1. In this dataset prosper Score looks normally distributed. With a mean score of 5.95 .Also to be noted there are a number of rows with data NA(not available) this is because as stated in the variable definition sheet the risk score is applicable only from 2009 onwards.

Plot 11 - What is prosper rating. how is it distribution in thid dataset?

Plot 11 - Comments

The Prosper Rating is proprietary system that allows prosper to maintain consistency when evaluating each loan application. Prosper Ratings allow investors to consider a loan’s level of risk because the rating represents an estimated average annualized estimated loss rate range.

Plot 12 - What is the emplyoment status of the borrowers like?

Plot 12 - Comments

About 67322 borrowers are employed. But i think part-time,full-time should belong to emplyoed category. It is interesting to know that even “Not employed” people get loan.

Plot 13 - How does the Debt To Income ratio look like for the Prosper Loans?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Plot 13 - Comments

The debt to income ration is highly skewed to the left. The mean debttoincome ratio is 0.276.

Plot 14 - What is the distribution of Income among the borrowers?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

Plot 14 - Comments

From the plot we can observe the income is skewed to the left. Very few borrowers earn more than 10000. The mean monthly income is 5608.

Plot 15 - How many loans originated by each quarter?

## 2005 Q4 2006 Q1 2006 Q2 2006 Q3 2006 Q4 2007 Q1 2007 Q2 2007 Q3 2007 Q4 
##      22     315    1254    1934    2403    3079    3118    2671    2592 
## 2008 Q1 2008 Q2 2008 Q3 2008 Q4 2009 Q2 2009 Q3 2009 Q4 2010 Q1 2010 Q2 
##    3074    4344    3602     532      13     585    1449    1243    1539 
## 2010 Q3 2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1 2012 Q2 2012 Q3 
##    1270    1600    1744    2478    3093    3913    4435    5061    5632 
## 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 
##    4425    3616    7099    9180   14450   12172

Plot 13 - Comments

Looks like proser laon data is from last quarter of 2005 until first quarter of 2014. The number of loans offere by prosper has increase gradually after the correction perios of 2009. https://en.wikipedia.org/wiki/Prosper_Marketplace

Univariate Analysis

What is the structure of your dataset?

The prosper loan data set consist of 113,937 loans and 81 variables. Most data is recorded between 2005 to 2014. Some of the variables include information about the loan such as the amount, rate, length, and status, and also information on the borrower, such as credit grade, income range, and employment status and so on.

What is/are the main feature(s) of interest in your dataset?

The dataset has 81 variables , in this section i am interested to understand the variable , look at their distribution and try to find some insights. For univariate analysis i have focused on the follwoing variables. ListingCreationDate Term ListingCategory..numeric. ProsperScore CreditScoreRangeUpper CreditScoreRangeLower LoanStatus LoanOriginalAmount BorrowerAPR BorrowerRate ProsperScore ProsperRating EmploymentStatus

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on my research about Prosper score i found that variables may also be interesting to anayse. CurrentDelinquencies StatedMonthlyIncome DebtToIncomeRatio https://www.prosper.com/plp/general-prosper_score/

Did you create any new variables from existing variables in the dataset?

Yes i created one variable called CreditScore by takin an average of CreditScoreRangeUpper and CreditScoreRangeLower. I also created variable called LoanOriginationQuarter using the zoo library.

Of the features you investigated, were there any unusual distributions?

When analysisng the distribution of no of loans vsr years , a dip was observed around 2008-2009.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The debt to income ratio was cut into bands to be able to analyse borrowers by debt to income ratio group.

Bivariate Plots

Plot 16 - How does a credit score affect the prosper rating?

## pd$ProsperRating..Alpha.: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   609.5   649.5   653.9   709.5   889.5     591 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   649.5   709.5   729.5   739.4   769.5   889.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   689.5   749.5   789.5   783.6   809.5   889.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   689.5   709.5   716.4   749.5   869.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   699.4   729.5   889.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   689.8   709.5   869.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   649.5   669.5   672.0   689.5   869.5 
## -------------------------------------------------------- 
## pd$ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   686.5   709.5   869.5

Plot 16 - Comments

There is a clear indication as the credit score increase the Prosper rating also increases. As the credit score increases the Prosper Rating falls under AA catergory which stands for lower risk.

Plot 17 - How does credit score affect the Borrowers interest rate?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

Plot 17 - Comments

Borrowers APR and Credit Score have a negative corelation. The Borrower APR increases with increase in Credit Score.

Plot 18 - How does the prosper score change based on Loan Amount.

Plot 18 - Comments

It is interesting to know that the prosper score is high for higher loan amount.

Plot 19 -Is the Prosperscore better for higher stated monthly income?

Plot 19 - Comments

The monthly income is a very important factor which inflence the prosper score. From the plot above we can see prosper score increase with larger stated monthly income.

Plot 20 - Does DebtToIncomeRation affect the BorrowerRate

Plot 20 - Comments

The debt-to-income ratio is one way lenders, including mortgage lenders, measure an individual’s ability to manage monthly payment and repay debts. So it is necssary to have a low debt to income ratio for better rate and the graph also shows us the borrower rate better rate when the debtTo income ratio is low.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Some of the key observations are 1) Credit score increase the Prosper rating also increases. As the credit score increases the Prosper Rating falls under AA catergory which stands for lower risk. 2) Prosper score is high for larger loan amount.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  1. Borrower rate better rate when the debtTo income ratio is low.
  2. Larger the stated monthly income btter is the prosper rate.

What was the strongest relationship you found?

Borrowers APR and Credit Score have a negative corelation. The Borrower APR increases with increase in Credit Score.

Multivariate Plots

Plot 21 - How is prosper loan and APR distributed foe diffent emplyoment status?

Plot 21 - Comments

In above plot it can be clearly seen Not Employed and Retired APR has been steady and higher for prosper scores below six.

Plot 22 - How does the borrower rate with credit score and prosper score?

## $title
## [1] " BorrowerAPR VS Credit Score with colored Prosper Rating"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

Plot 22 - Comments

The loans high prosper score and credit score had a lower Borrower Rate. From the above plot we can see the Prosper score of 7.5 and above and a credit score of 700 and above had a low borrower rate.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The multivarite plots futher prove that prosper score and credit rate have a very high inflence on the borrowers rate.

Were there any interesting or surprising interactions between features?

It is observed that empllyoment status had an influence on APR and credit score. Not Employed and Retired borrowers APR has been higher for prosper scores below six. It is interesting to note that self emplyoed managed to get better APR and prosper score.

Final Plots

Plot 1

Plot 1 - Comments

In this plot it can be observed ProsperRating Increases with increase in the credit score.

Plot 2

Plot 2 - Comments

The plot shows emplyoed borrower get higher prosperscore and also better APR . It can also be seen self-emplyoed borrowes also seem to get almost the same kind of prosper rating and APR value. A combination of other vaiables help to decide on the prosper score and APR.

Plot 3

Plot 3 - Comments

The plot above shows the distribution of loans over a time period of 2005- 2014.Most importantly it the trend the loans follow before and after 2009 Q2. We can observe for a given year the loan amount increase from first quarter to last quarter.

Reflection

About the Data

This data set was a challenge mainly because of the number of variables. It gave a good experience on how real wrold data is and since my knowledge on loans was limited i had to spend some time understanding the terms and also make decisions on what kind of questions to ask to explore this Prosper loan data.

Challenges

Working with ggplot2 to explore th variables was not easy. I had to iterate the code many times to get a satisfactory representations. The main area of difficulty is choose the correct scale so the graph can represent the data better. Another chanllenge was choosing the right graph type since the dataset has both categorical and quantitative values.

While testing the bivariate and multivarite plots some of the relationships was not as i expected for example - Borrowers APR and Credit Score has a negative corelation. The Borrower APR increases with increase in Credit Score. Atleast to my understanding i thought the borrower can get a lower APR with a good credit score rating.

Findings

I did manage to observe some interesting factors about indidvidual variables expecially i would like to mention the distribution of loan across the quarters gives a good overview of Prosper loan performance. This promted me to go look for the reason why there was a dip in loan sales during 2009. I also managed to find some key variable which influence the interest rate. I found the Prosper Score and Prosper Rating are key variables which affect the borrowers APR and interest rate to name some.

Future improvement

After doing a lot of reading i understood there are many interesting questions to be askes and explored. This is just a small analysis presented here for the project requirements, I wish to further continue and explore the dataset especially i want to see distribution of loan across different states, which state perform well in paying the laon on time. Also looking at what king of loans investors prefer to invest in and what a re the key factors whaich help them in this process may be a good area of investigation moving forward.